Wines Points prediction

Here we will try to predict the points a wine will get based on known characteristics (i.e. features, in the ML terminology). The mine point in this stage is to establish a simple, ideally super cost effective, basline. In the real world there is a tradeoff between complexity and perforamnce, and the DS job, among others, is to present a tradeoff tables of what performance is achivalbel at what complexity level.

to which models with increased complexity and resource demands will be compared. Complexity should then be translated into cost. For example:

Loading the data

Data Exploration

Here we will try to better understand the data, it's content, missing values, interesting correlations etc.

We see that some columns contain nulls.

Title

It seems each row corresponds to a specific wine bottle i.e. each row is a unique bottle. We can verify that by first:

We see that the numbers of unique titles (118840) is smaller than the number of rows. Which means our assumption above was wrong. Let's look at rows having the same title:

So we see a problem in the data:

Depeniding on the ML problem we want to solve, and the magnitude of the problem, we might or might not fix those issues.

Addig new column for year (extract from title)

inside the title we can find the year which the wine was produced. we'll extract that year and add it to the dataframe as a new column

Points

let's start looking at the points column. From the above we see it has no missing values. Some initial insights regarding the points distribution:

Price

Country

What is the average score and price per country?

Designation, province, regions, winary

Correlation between price and points

But prices are very strongly depending on country, while points (at least in theory - this assumption should be tested) are not. So:

Correlation between year and points

Remove price outliers

Data exploration conclusions

1.The original dataset contains 129971 records.

  1. columns in dataset:

    1 country 129908 non-null object 2 description 129971 non-null object 3 designation 92506 non-null object 4 points 129971 non-null int64
    5 price 120975 non-null float64 6 province 129908 non-null object 7 region_1 108724 non-null object 8 region_2 50511 non-null object 9 taster_name 103727 non-null object 10 taster_twitter_handle 98758 non-null object 11 title 129971 non-null object 12 variety 129970 non-null object 13 winery 129971 non-null object

  2. after removing the duplicated records there were 118840 records left (meaning 11,131 duplicate records)

  3. the title attribute contains in some cases the year where the wine was manifactured. I saved all values between year 1800-2022. other values I manually saved as zero in the new year column
  4. the statistics for points:

count 119988.000000 mean 88.442236 std 3.092915 min 80.000000 25% 86.000000 50% 88.000000 75% 91.000000 max 100.000000

no null values were found for points

  1. the statistics for price:

count 111593.000000 mean 35.620747 std 42.103728 min 4.000000 25% 17.000000 50% 25.000000 75% 42.000000 max 3300.000000

  1. there is a certain correlation between price and points level. meaning - more expensive wine usually got a higher points score
  2. older wines are less common in the dataset, mostly older wines recive a higher score
  3. after removing outliers in price there were 109016 records left (9824 outliers found)

Points prediction

Points is descrete value target. There for we are talking about a prediction (Regression) problem (in contrary to classification problem). Prediction solutions can be measured in few metrics:

Read more here

Train and test set split

To properly report results, let's split to train and test datasets:

Baselines

Baseline 1

The most basic baseline is simply the average points. The implementaion is as simple as:

Baseline 2

We can probably improve by predicting the average score based on the origin country:

Baseline 3

Adding more breakdowns will increase our granularity but can result in overfitting. Yet:

summary

Training a Boosting trees regressor

Preparing data - Lable encoding categorical features

Re-splitting to train and test

Fitting a tree-regressor

Let's look at the function output - specifically the xgb_clf_points_prediction column:

Summary

Classical NLP approaches

Description field only

Creating methods that will be used for handling text properly

Using only the text from the "description" column

train and test split

NLP model creation

Summary

Using both the text and other features

TBD

Train and test split

Preparing data - Lable encoding categorical features

Create sparse matrix for all features - after encoding

NLP model creation

Summary

Deep Learning approaches

Fully connected network on the text only

Tokenization

What is a good size for the vocabulary?

Modeling

Simple NN Prediction & Evaluation

Summary

Descriptions words concatinated

Summary

Using external embedding (description feature only)

Follow https://keras.io/examples/nlp/pretrained_word_embeddings/

You can either average the description words embeddings, concatinate them or do both and compare.

create external matrix using glove extract file

create embedding layer for model

Comparing the result we get when trainable=True in the embedding layer

Summary

Using LSTM or RNN Layer, (description feature only)

Summary

Bonus (Not mandatory) Use all features using Keras functional API

See here: https://keras.io/guides/functional_api/

Final Conclusions

  1. results summary:

The best prediction results received in NLP on texk only. and very close results in NLP with all features (including the extract year from title

Convert to HTML